1 Introduction and set-up

This project is made to be read in html, so open the html file in your preferred webbrowser. As standard the code is hidden in this document, but you can show all by pressing the button “Code” in the top right of the document. You can also show individual chunks of code by pressing the buttons “Code” which are placed around in the document.

Link for google colab: https://colab.research.google.com/drive/1SK3M2bfXmPKTQXr4utg8nCO4s6JPRWMw

Link for github: https://github.com/DataEconomistDK/M2-Group-Assignment

In this project we will work with a dataset of 5.000 consumer reviews for a few Amazon electronic products like f. ex. Kindle. Data is collected between September 2017 and October 2018. This is a sample taken from Kaggle which is a part of a much bigger dataset available trough Datafiniti. The data can be collected from this link: https://www.kaggle.com/datafiniti/consumer-reviews-of-amazon-products?fbclid=IwAR1o_blPfHeBPmnUzAOW7Ct24L7fhbI3OGcbfaVgaDZENhVXwaCP4godKvQ#Datafiniti_Amazon_Consumer_Reviews_of_Amazon_Products.csv

Note there is 3 available dataset on kaggle, but the file used here is called “Datafiniti_Amazon_Consumer_Reviews_of_Amazon_Products”. The file is downloaded as is, and imported further below.

1.1 Loading packages

First i have some personal setup in my local R-Markdown on how i want to display warnings ect. And then i load my packages.

1.2 Loading and filtering data

Now we load the data we downloaded from kaggle. From this file we select the following variables:

  • id: An id number given to each review created by us corrensponding to the row number of the raw data.

  • name: The full name of the product

  • reviews.rating: The rating of the product on a scale from 1-5.

  • reviews.title: The title of the review, given by the customer.

  • reviews.text: The review text written by the customer.

data_raw <- read_csv("Datafiniti_Amazon_Consumer_Reviews_of_Amazon_Products.csv") %>% 
  select(name, reviews.rating, reviews.text, reviews.title) %>% 
  mutate(id = row_number())

As the data is very raw and messy we now do some cleaning. We remove everything that is not normal letters, such as punctuations, numbers, special characters ect, and changing all strings to lower case in the review text.

We will also do some lemmatization. The purpose of this is to not only analyze the exact word strings in the reviews, as this would include several possible forms of the words used. F. ex. think and thought. Instead we want to merge all possible forms of a word into it’s root word. Lemmatization try and do so, by using detailed dictionaries which the algorithm looks trough to link a given word string back to it’s root word. This is a more advanced method than stemming and should be beneficial in this report.

We here want to primarily work with tidy text, where there is one token per row. So new a clean and filtered dataset is created both with tokens and as normal dataframe with the review text.

tokens_clean <- data_raw %>% 
  unnest_tokens(word, reviews.text, to_lower = TRUE) %>% 
  mutate(word = word %>% str_remove_all("[^a-zA-Z]")) %>%
  filter(str_length(word) > 0) %>% 
  mutate(word = lemmatize_words(word))

reviewtext_lemma <- tokens_clean %>% 
  group_by(id) %>% 
  summarize(reviews.text = str_c(word, collapse = " ")) %>% 
  ungroup() %>% 
  select(reviews.text) %>% 
  as_vector()

data_clean <- data_raw %>%
  mutate(reviews.text = reviewtext_lemma)

We now have 153.994 tokens, in their each seperate rows in the tokens dataset. By doing lemmatization the number of unique tokens are reduced from around 6000 to around 4600 words, which should prove quite beneficial.

2 Network analysis

In this assignment we want to use network analysis to gain new insights into how the reviews are structured. Here we extract bigrams from each review text, clean and prepare them to then create networks. Where we before considered tokens as individual words, we can create them as n-grams that are a consecutive sequence of words. Bigrams are n-grams with a length of 2 consecutive words. This can be used to gain context and connection between words.

Bigrams are now created, by unnesting the tokens.

bigrams <- data_clean %>%
  unnest_tokens(bigram, reviews.text, token = "ngrams", n = 2) # n is the number of words to consider in each n-gram. 

bigrams$bigram[1:2]
## [1] "the display" "display be"

Remember that each bigram overlap, as can be seen from above, so that the first token is “the display” and the second is “display is”. Now the most common bigrams are displayed.

#Counting common bigrams
bigrams %>% 
  count(bigram, sort = TRUE)

Notice the most common bigrams are: “for my”, “easy to”, “to use”, “it is”. These are mostly stopwords, which is not very usefull for the analysis. To remove these from the bigrams, we now split the bigram into 2 columns word1 and word2, and then filter them away if either of them is a stopword. The stopwords are taken from a dictionary called stop_words. Now we make a new a new count to see the most bigrams after filtering.

bigrams_separated <- bigrams %>% 
  separate(bigram,c("word1","word2"),sep = " ")

bigrams_filtered <- bigrams_separated %>% 
  filter(!word1 %in% stop_words$word) %>% 
  filter(!word2 %in% stop_words$word)

#New bigram counts
bigram_counts <- bigrams_filtered %>% 
  count(word1, word2, sort = TRUE)

bigram_counts

Above we can see that the most common bigrams are now mostly product names such as “kindle fire”, “battery life”, “amazon fire”, “amazon echo” ect. We now combine the 2 columns again into a single column with the bigram, to do further analysis. This is done by using the ‘tidyr’ function ‘unite’. The purpose is to treat the bigram as a ‘term in a document’.

The interesting thing is now to visualize the relationship between all words. To this we will use the package igraph. Before doing this we will need to create the graph from a data frame of the bigrams. Here nodes are the words, and the edges correspond to the connection between the two words in the bigram. The first word in the bigram is the column ‘from’, and the second word is ‘to’, and it’s therefore a directed network. The edges are given a weight corresponding to how many times it occures in the total amount of reviews called ‘n’. The weight is plotted as the alpha value, so more frequent bigrams have a darker colour, and vice versa. Only bigrams that occure more than 15 times are plotted in this network, as it otherwise would get to messy.

set.seed(123)
bigram_graph <- bigram_counts %>% 
  filter(n > 15) %>%  #The occurence of the bigram is more than 15. 
  graph_from_data_frame()

a<- grid::arrow(type = "closed",length = unit(.15,"inches"))

ggraph(bigram_graph, layout = "fr") + 
  geom_edge_link(aes(edge_alpha = n), show.legend = FALSE,
                 arrow = a, end_cap = circle(.05,'inches'))+
  geom_node_point(color = "pink", size = 3) + 
  geom_node_text(aes(label = name),vjust=1,hjust=1) + 
  theme_void()

The plot above, give us some insights about the connection of words in the reviews. If we where to chose a random word in the graph, the most likely word to come afterwards would be the outgoing connection with the darkest colour. This way we can kinda predict what words that come next. Remember that the words have been lemmatized, so it shows the root word so the sentence created would not be grammatical correct, but would still carry the meaning as a whole.

We see many small connections such as customer -> service, sound -> quality, black -> friday and ect. Then we also have a bigger cluster where love is one of the key words. Many words such as kid, daughter, son, wife ect. point in the direction of love, and then outgoing edges from love is play, watch, alexa. Creating sentences such as “wife love alexa” or “kid love play”. So first we have the person, then the sentiment word love, and then the action they do or what they love. We see that amazon is a central word with many outgoing connections, as many things are called “amazon prime”, “amazon account” ect. Other key nodes are the product names such as “fire”, “kindle”, “hue”.

2.1 Correlation

We can also look at correlation among words, which indicates how often they appear together relative to how often they appear separately. We here use the phi coefficient, which is a measure for binary correlation. So how often does both words appear in the document or neither do, versus one appears without the other. The can be done with the pairwise_cor command from the tidytext package. The calculation is done with each review being the document.

#Correlation bigrams
bigram_section <- tokens_clean %>% 
  filter(!word %in% stop_words$word)

word_pairs <- bigram_section %>% 
  pairwise_count(word, id, sort = TRUE)

word_cors <- bigram_section %>% 
  group_by(word) %>% 
  filter(n()>= 20) %>% 
  pairwise_cor(word,id,sort=TRUE)

word_cors

The words come in double pairs “friday black” and “black friday”. We see these have a very high correlation at around 0.8, meaning they occur 80% of time in the same review text. The words with the highest correlation might not be the most meaningfull, such as “doorbeel ring” but some might show new insights. We show this in a network graph. As correlation is undirected there is no arrows. The colour of the edge shows the degree of correlation. The plot only shows pairs with a correlation above 27,5%.

word_cors %>% 
  filter(correlation > .275) %>% 
  graph_from_data_frame() %>% 
  ggraph(layout="fr") +
  geom_edge_link(aes(edge_alpha = correlation), show.legend = TRUE) +
  geom_node_point(color = "pink",size=3) +
  geom_node_text(aes(label = name), repel = TRUE) +
  theme_void()

From the previus bigram plot, the correlation plot is not nearly as connected. The highest amount of connections is 4, and the most frequent is 2. In this plot words such as love, that was very important before is now gone. This is because the word love just occurs very often on itself, and can be used together with many words, and therefore does not have high correlation with any specific words. As before we also have connections such as “amazon prime”, “hue bulb”, which are words that almost always occur together.

3 NLP

In this section we will analyze the data using Natural Language Processing. Here we will gain insight in the dataset by extracting and identifing patterns. We will start by doing some data preprocessing before we start our analysis. Then we will do a sentiment analysis, where we analysis the words and whether they’re positive or negative. Next we will do a LSA analysis, where we will apply dimentionality reductions and cluster the data to look for latent patterns for the words and documents. And then we will do a LDA analysis to try to seperate the data into topics to look for patterns for the words.

3.1 Data preprocessing

Up until now special characters, numbers and special letter have been removed and the tokens have been unnested. Words have also been through a lemmatazation.

Before looking for our own stopwords, we will move all stopwords build into the package, tidytext, called SMART.

tokens_nlp <- tokens_clean %>% anti_join(stop_words)

After that we will look through the tokens_clean dataframe again and remove our own stopwords, where we decied to remove five stopwords (“im”, “ive”, “dont”, “doesnt”, “didnt”).

own_stopwords <- tibble(word= c("im", "ive", "dont", "doesnt", "didnt"), 
                        lexicon = "OWN")
tokens_nlp <- tokens_clean %>% 
  anti_join(stop_words %>% bind_rows(own_stopwords), by = "word")

We will now look at the top words and plot them:

topwords <- tokens_nlp %>%
  count(word, sort=TRUE)

topwords %>%
  top_n(20, n) %>%
  ggplot(aes(x = word %>% fct_reorder(n), y = n)) +
  geom_col() +
  coord_flip() +
  labs(title = "Word Counts", 
       x = "Frequency", 
       y = "Top Words")

We see that the individual tokens that are most frequent are sentiment words such as “love”, “easy” or product names such as “tablet”, “kindle” or the action they take “buy” or “read”.

3.2 TF-IDF

Up untill now, equal weight have been given to all words, but some are more rare than others. Term frequency–inverse document frequency or just tf-idf, is a way to analyze how important a word is to a document in a corpus:

\[\text{tf-idf}(t, d) = \text{tf}(t, d) \times \text{idf}(t)\] Here tf is the term-frequency and idf is the inverse document-frequency, a coefficient which is larger whenever the particular term is found in a lesser number of documents.

We tried to run a tf-idf analysis but we couldn’t really say anything from the analysis, probably because there’s a lot of documents. Every person has their own dictionary and a lot of words may appear very rare, and therefor they may be giving a high idf coefficient, which is why their tf-idf is high. If we were analyzing a number of books, the analyses may have made more sense.

3.3 Sentiment analysis

Sentiment analysis refers to a use of text analysis to extract and identify subjective information, where it analyzises whether the words are positive or negative. In this section, we will be doing two sentiment analysis, first by identifying positive and negative words using the bing lexicon and after this using the afinn lexicon.

3.3.1 General analysis

Before doing the sentiment analysis, we will quickly look a the distribution of the review ratings. First we run a summary.

summary(tokens_nlp$reviews.rating)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.000   4.000   5.000   4.533   5.000   5.000

Then we run a table.

table(tokens_nlp$reviews.rating)
## 
##     1     2     3     4     5 
##  1152   627  2210 13509 34746

Here, we can see that there is a overepresentation of positive reviews, where the mean is at 4.533 and the median at 5.00. This will contribute to how we do the rest of the sentiment analysis. There is 1152 one-star review rating, 627 two-star review rating, 2210 three-star review rating, 13509 four-star review rating and 34746 five-star review rating.

3.3.2 Bing

We wil start with the Bing lexicon. The Bing lexicon categorizes words in a binary fashion as positive or negative with no weighting. Here, we are using the function get_sentiment to get a specific sentiment lexicon and inner_join to join the lexicon with tokenized data.

Now we are plotting a word count, grouped by sentiment, showing the 10 most frequent negative and positive words.

sentiment_bing <- tokens_nlp %>% inner_join(get_sentiments("bing"))

sentiment_analysis <- sentiment_bing %>% 
  filter(sentiment %in% c("positive", "negative"))

word_counts <- sentiment_analysis %>%
count(word, sentiment) %>%
group_by(sentiment) %>%
top_n(10, n) %>%
ungroup() %>%
mutate(
word2 = fct_reorder(word, n))

ggplot(word_counts, aes(x = word2, y = n, fill = sentiment)) +
geom_col(show.legend = FALSE) +
facet_wrap(~ sentiment, scales ="free") +
coord_flip() +
labs(title ="Sentiment Word Counts",x ="Words")

Here we can see the positive words are much more frequent than the negative words. For positive the words “love” and “easy” is way more frequent than all other words.

And now we will count all positive and negative words for each number of stars.

tokens_nlp %>% 
  inner_join(get_sentiments("bing")) %>%
  count(reviews.rating, sentiment)

From the above table we can see that all categories of reviews both include positive and negative words. Even in 1 star rating reviews, almost a third of all sentiments words are positive. This might indicate that people use negative expressions such as “this is not very good”, and then the only sentiment word is good which classifies as positive, but in reality the sentiment should be negative. This could be fixed with further analysis to remove the effect of negatives, but is not within the current time scope of this project. We also see that 2 star reviews have almost equal negative and positive sentiment words, which results in a almost neutral sentiment.

If we sum the above sentiment words, we can see there is a total of 9.352 sentiment words in our data out of a total of 52.244 words, which means that around 20% of all words left is sentiment words. This means that most words are still some form of stopword, neutral sentiment word or a product word. In the bing lexicon, there’s around 6000 unique sentiment words and in our data there is 345 unique sentiments words from the bing lexicon.

Now we will find the overall sentiment score for every review rating, taking the positive sentiments and subtracting the negative. Then we take the mean and plot it.

tokens_nlp_bing <- tokens_nlp %>%
  inner_join(get_sentiments("bing")) %>%
  count(reviews.rating, sentiment) %>%
  spread(sentiment, n) %>%
  mutate(overall_sentiment = positive - negative)

n1 <- tokens_nlp %>% filter(reviews.rating == 1)
s1 <- tokens_nlp_bing$overall_sentiment[1] / count(n1)
n2 <- tokens_nlp %>% filter(reviews.rating == 2)
s2 <- tokens_nlp_bing$overall_sentiment[2] / count(n2)
n3 <- tokens_nlp %>% filter(reviews.rating == 3)
s3 <- tokens_nlp_bing$overall_sentiment[3] / count(n3)
n4 <- tokens_nlp %>% filter(reviews.rating == 4)
s4 <- tokens_nlp_bing$overall_sentiment[4] / count(n4)
n5 <- tokens_nlp %>% filter(reviews.rating == 5)
s5 <- tokens_nlp_bing$overall_sentiment[5] / count(n5)

x <- c(s1,s2,s3,s4,s5)

ggplot(tokens_nlp_bing, aes(x = reviews.rating, y = x, fill = as.factor(reviews.rating))) + geom_col(show.legend = FALSE) + coord_flip() +
labs(title = "Mean of Sentiment Score for All Review Ratings",
     subtitle = "Reviews",
     x = "Review rating",
     y = "Mean Score of Sentiment")

The scores are generally very low, because only 20% of all words in the reviews are sentiment words, and that positive and negative sentiments are averaged out. After taking all of this into account then a 5 star review on average only have around 0,14 positive sentiment, where a value of 1 would be one more positive sentiment word than negative. As we know that we have 9.352 sentiment words in our data, and 5000 reviews, this means that in total we have only around 2 sentiment words pr. review. So it would be hard to predict the amount of stars in the review only based on this bing sentiment, but it could assist in supervised ml.

3.3.3 Afinn

Now, we will analyze the data using the afinn lexicon, which gives every word a score between -5 and 5. Here 5 is very positive, and -5 is very negative. We are again using the function get_sentiment to get a specific sentiment lexicon and inner_join to join the lexcon with tokenized data. After this we can summarize the value of each review rating, take the mean value and plot it.

# get_sentiments("afinn") # used to download afinn package
sentiment_afinn <- tokens_nlp %>%
  inner_join(get_sentiments("afinn")) %>%
  group_by(reviews.rating) %>%
  summarize(sentiment = sum(value)) %>%
  arrange(sentiment)

n1 <- tokens_nlp %>% filter(reviews.rating == 1)
s1 <- sentiment_afinn$sentiment[1] / count(n1)
n2 <- tokens_nlp %>% filter(reviews.rating == 2)
s2 <- sentiment_afinn$sentiment[2] / count(n2)
n3 <- tokens_nlp %>% filter(reviews.rating == 3)
s3 <- sentiment_afinn$sentiment[3] / count(n3)
n4 <- tokens_nlp %>% filter(reviews.rating == 4)
s4 <- sentiment_afinn$sentiment[4] / count(n4)
n5 <- tokens_nlp %>% filter(reviews.rating == 5)
s5 <- sentiment_afinn$sentiment[5] / count(n5)

x <- c(s1,s2,s3,s4,s5)

ggplot(tokens_nlp_bing, aes(x = reviews.rating, y = x, fill = as.factor(reviews.rating))) + geom_col(show.legend = FALSE) + coord_flip() +
  labs(title = "Mean of Sentiment Score for All Review Ratings",
       subtitle = "Reviews",
       x = "Review rating",
       y = "Mean Score of Sentiment")

What’s interesting here is that all reviews, except one star rating, have a positive score. Of course this could also be, because it categorizes negative words as positive words or vice versa, just like we discussed in the bing sentiment analysis.

3.4 LSA

Latent Semantic Analysis or simply LSA is a techique to identify and analyze the cooccurrences of words across documents. Coorccurrence suggest that the words are somewhat correlated, either by being synonymous or reflect a shared concept. Examples of shared concepts could be colors or cities. We want to extract meanings between documents and words, assuming that words that are close in meaning will appear in similar pieces of texts.

First we create a sparse document-feature matrix from the corpus. Here we turn a tidy one-term-per-document-per-row data frame into a TermDocumentMatrix from the quanteda package.

#Document-feature-matrix
data_dfm <- tokens_nlp %>% 
  count(id, word) %>%
  cast_dfm(document = id, term = word, value = n)

Here, we have a document-feature matrix with 3306 documents (reviews) and 3542 feautures (words).

Then we fit the LSA scaling model to the dfm, where we set nd, the number of dimensions, to 10.

3.4.1 Feature analysis

And now we pull out the feautures and change them to a data frame. We’ll start to look at the features, hence words.

data_lsa_loading <- data_dfm1$features %>%
  as.data.frame() %>%
  rownames_to_column(var = "word") %>% 
  as_tibble()

Now we can use the umap function, which stands for Uniform Manifold Approximation and Projection, a technique for dimension reduction. The function computes a manifold approximation and projection.

And then we transform it into a dataframe.

data_lsa_umap %<>% as.data.frame()

And then a transform it into a matrix and assign it a different name. The function hdbscan computes the hierarchical cluster tree, where minPts is the minimum size of the clusters. We are dealing with 3542, so a minimum could be 200 points.

set.seed(123)
data_lsa_hdbscan <- data_lsa_umap %>% as.matrix() %>% hdbscan(minPts = 300)

And now we can plot the features, here in a two dimensional plot. Here it’s clusters after the function clutser for the hbdscan and it assigns a prob for each data point, which is a probability of a data point within its cluster, which runs from 0 to 1.

set.seed(123)
x <- data_lsa_umap %>% 
  bind_cols(cluster = data_lsa_hdbscan$cluster %>% as.factor(), 
            prob = data_lsa_hdbscan$membership_prob) %>%
  ggplot(aes(x = V1, y = V2, col = cluster)) + 
  geom_point(aes(alpha = prob), shape = 21)

ggplotly(x)

Here, the function in R has reduced the number of dimensions in the data set using the latent features of the data.

And now we run a table to see how many feautures are in each cluster.

table(data_lsa_hdbscan$cluster)
## 
##    0 
## 3731

Here we can see the interactive plot for the features and how they cluster together.

Here, there’s two different clusters and a lot of outliers. One of the clusters is quiet big and have 1536 out of the 3542 features. The others are smaller and there’s 91 outliers, which doesn’t have a cluster. There could be lot more clusters, because each cluster should have a minimum of 200 features, but the function only makes two clusters. Unfortunately, we will not go any further regarding which words are in which clusters.

3.4.2 Document analysis

Now, we will move on to analyzing the reviews and how they cluster.

data_lsa_loading <- data_dfm1$docs %>%
  as.data.frame() %>%
  rownames_to_column(var = "id") %>% 
  as_tibble()

Now we can use the umap function, which stands for Uniform Manifold Approximation and Projection, a technique for dimension reduction. The function computes a manifold approximation and projection.

And then we transform it into a dataframe.

data_lsa_umap %<>% as.data.frame() 

And then a transform it into a matrix and assign it a different name. The function hdbscan computes the hierarchical cluster tree, where minPts is the minimum size of the clusters. We are dealing with 3542, so a minimum could be 200 points.

set.seed(123)
data_lsa_hdbscan <- data_lsa_umap %>% as.matrix() %>% hdbscan(minPts = 300)

And now we can plot the features, here in a two dimensional plot. Here it’s clusters after the function clutser for the hbdscan and it assigns a prob for each data point, which is a probability of a data point within its cluster, which runs from 0 to 1.

set.seed(123)
x <- data_lsa_umap %>% 
  bind_cols(cluster = data_lsa_hdbscan$cluster %>% as.factor(), 
            prob = data_lsa_hdbscan$membership_prob) %>%
  ggplot(aes(x = V1, y = V2, col = cluster)) + 
  geom_point(aes(alpha = prob), shape = 21)

ggplotly(x)

And now we run a table to see how many feautures are in each cluster.

table(data_lsa_hdbscan$cluster)
## 
##    0    1    2    3    4    5 
## 1453 1045  444  308  944  798

Here the documents cluster differently than the features and they are much more spread out, even when the minimum points are the same. There’s almost as many outliers compared to the last plot. Here there’s six clusters, which could be the five different rating the documents have clustered after, but it’s latent features.

3.5 LDA

Linear Discriminant Analysis (LDA) is a method used to find linear combinations of features that characterizes or seperates two or more classes of objects or events. LDA is closely related to Principal Component Analysis (PCA).

First, we will prepare the data.

data_dtm <- tokens_nlp %>%
  count(id, word) %>%
  cast_dtm(document = id, term = word, value = n, weighting = tm::weightTf)

data_dtm
## <<DocumentTermMatrix (documents: 4992, terms: 3731)>>
## Non-/sparse entries: 47101/18578051
## Sparsity           : 100%
## Maximal term length: 51
## Weighting          : term frequency (tf)

The matrix above is rather sparse (Sparsity = 100%). We can try to reduce this by deleting less often used terms.

data_dtm %>% removeSparseTerms(sparse = .99)
## <<DocumentTermMatrix (documents: 4992, terms: 180)>>
## Non-/sparse entries: 28798/869762
## Sparsity           : 97%
## Maximal term length: 13
## Weighting          : term frequency (tf)

The Sparsity is now 97% which is less than before but still rather sparce. The number of terms was reduced by too much.

data_dtm %>% removeSparseTerms(sparse = .999)
## <<DocumentTermMatrix (documents: 4992, terms: 1116)>>
## Non-/sparse entries: 42893/5528179
## Sparsity           : 99%
## Maximal term length: 13
## Weighting          : term frequency (tf)

The Sparsity is now 99% which is higher than before. The number of terms is still quite low compared to the ‘original’ data.

data_dtm %>% removeSparseTerms(sparse = .9999)
## <<DocumentTermMatrix (documents: 4992, terms: 3731)>>
## Non-/sparse entries: 47101/18578051
## Sparsity           : 100%
## Maximal term length: 51
## Weighting          : term frequency (tf)

The results above is just the exact the same as before we tried to remove the sparse terms. It doesn’t seems like it’s worth to try to reduce the sparsity vs. the reduction of the terms. Therefore we’ll just accept a high level of sparsity (100%) to keep all of the terms.

Next we perfome a LDA. Beta in an output of the LDA model. Beta indicates the probability that a word occurs in a certain topic.

Above the top 10 terms in each LDA topic are displayed. We choose the number of two clusters since choosing a higher number results in the same words displayed in two or more clusters.“love” is the word with the highest probability of occuring in topic 1, while “tablet” is the word with the highest probability of occuring in topic 2.

It seems like cluster 2 contains some words with a tecnological character (echo, screen, app, alexa, device) while cluster 1 seems related to books/reading (read, book, kindle) and positive words (love, easy).

4 Supervised machine learning

In this section we will analyse whether we can predict if the review is good, here categorized as rating 4 or 5. We will apply supervised machine learning to do so and create a binary logistic model, which can take the value 1 for rating 4 or 5 or value 0 for rating 1, 2 or 3.

We start by splitting out data_clean into test and training data. Out test data consist of 25 percent, here 1250 reviews, and our training data consist of 75 percent, here 3750 reviews.

set.seed(123)
split = data_clean %>% select(id, reviews.rating) %>% initial_split()
train_data = training(split)
test_data = testing(split)

After that we want to transform our training data into a sparse matrix to use for our machine learning algorithm. Transforming training data to a sparse matrix, so a given cell is either empty or indicating the frequency it occurs by in a given document.

set.seed(123)
sparse_words = tokens_clean %>% count(id, word) %>% inner_join(train_data) %>% cast_sparse(id, word, n)

dim(sparse_words)
## [1] 3750 3715

In our training set, we have 3750 reviews and the 3625 features, which is the different tokens. We could also cbind other columns such as numeric data into the matrix here, such as sentiment numbers.

We now build a dataframe with the response variable for the ratings.

set.seed(123)
word_rownames = as.integer(rownames(sparse_words))

data_clean$reviews.rating[data_clean$reviews.rating == 1] <- 0
data_clean$reviews.rating[data_clean$reviews.rating == 2] <- 0
data_clean$reviews.rating[data_clean$reviews.rating == 3] <- 0
data_clean$reviews.rating[data_clean$reviews.rating == 4] <- 1
data_clean$reviews.rating[data_clean$reviews.rating == 5] <- 1

joined = data_frame(id = word_rownames) %>% left_join(data_clean %>% select(id, reviews.rating))

Now we want to train our model. To do this we will be using glmnet which can handle parallel processing, this make it possible for training the model with cross validation. We run a binominal model without an intercept, where sparse_words is our x and reviews.rating in joined equal to 1 is our y.

set.seed(123)
j = joined$reviews.rating == 1
model = cv.glmnet(sparse_words, j, family = "binomial", keep = TRUE, intercept = FALSE
)

The model is made, now we want to dig more deeply into the understading of our model. First we want to investigate which predictors there are driving the model. To do this we will check the coefficients of the models, with the largest value of Lambda.

set.seed(123)
coefs <- model$glmnet.fit %>%
  tidy() %>%
  filter(lambda == model$lambda.1se)

coefs %>%
  group_by(estimate > 0) %>%
  top_n(15, abs(estimate)) %>%
  ungroup() %>%
  ggplot(aes(fct_reorder(term, estimate), estimate, fill = estimate > 0)) +
  geom_col(alpha = 0.8, show.legend = FALSE) +
  coord_flip() +
  labs(
    x = NULL,
    title = "Coefficients that increase/decrease probability the most"
  )

The graph above gives an overview, over which coefficients that either increase or decrease the probalitiy of the models prediction of the rating. Here we can see that ‘terrible’ and ‘defective’ have a negative effect on the probability that the rating is either 4 or 5 stars. On the other hand ‘love’ and ‘great’ will increase the probability that the review is either a 4 or 5 star rating.

And now we want to evaluate our test data. Furthermore we want to investigate the confusion matrix of the model. Here we want to make a classification, which will tell how many predictions were true.

set.seed(123)
classifications = tokens_clean %>%
  inner_join(test_data) %>%
  inner_join(coefs, by = c("word" = "term")) %>%
  group_by(id) %>%
  summarize(score = sum(estimate)) %>%
  mutate(probability = plogis(score)) %>% 
  left_join(data_clean, by="id")

classifications

And here we are running the confusion matrix. The probability for the classification is the proabbility that the review is a 4 or 5 star rating, and we set the probability to 0.5, which is the middle between the two.

confusion_matrix = table(classifications$reviews.rating == 1, classifications$probability > 0.5)
confusion_matrix
##        
##         FALSE TRUE
##   FALSE     7   71
##   TRUE      5 1158

As the result shows us. The model could predict 1154 true out the original 1250 observations. Now we want to check the accuracy.

To check the accuarcy, we choose to use the package ‘yardstick’ to calculate it:

yardstick::accuracy(confusion_matrix)

As the above result shows, the model could predict 93 percent of the 4 and 5 star rated reviews based on the test and training set. 93 percent is pretty high, but the data set is a bit biased, as the group 4 and 5 stars is a lot bigger than 1, 2 and 3 stars and therefor you can question the accuracy and whether it would predict as well if the two groups would have the equal size.

5 Conclusion

In this project we analysied a dataset of 5.000 reviews from Amazone. First, we performed a Network Analyse were we got some insights about the connections of the words in the reviews. Next, we moved on to the NPL were we got insight in the patterns inside the dataset, where there’s patterns to spot. In the sentiment analysis we were showed which words were positive and negative by the Bing lexicon. By using the Afinn lexicon the positive and negative were weighted.

Lastly we used Maschine Learning to predict wether a review were 4/5 stars. We got an accuracy of 93% which is pretty high. But as we saw the most reviews were given 4 or 5 stars why we could expect a high accuracy.